Final: Effects of Air Pollution on Countries
1. Introduction
Motivation
Climate issues are pervasive but typically disproportionately affect low income communities and developing countries. Our group wanted to explore how air pollution has changed over time and affect countries differently. Specifically, we wanted to analyze how a country’s economic and social position can either increase, decrease, or not have observable impact on the affects of air pollution. In laymen terms, does air pollution affect underdeveloped countries disproportionately?
Set Up
Before we start, we need to ensure that we have all the relevant libraries installed and imported.
Run these in the console, or only the ones that your system does not have, to install packages in addition to the ezids package.
install.packages("tidyverse")
install.packages("rworldmap")
install.packages("tmap")
install.packages("spData")
install.packages("sf")
install.packages("ggpubr")
install.packages("dplyr")
install.packages("knitr")
install.packages("magrittr")
2. Data Sources and Data Wrangling
Data Sources
For our analysis, we will be working with 5 main data sources shown in the table below:
| Data | Source | Link |
|---|---|---|
| Deaths Due to Air Pollution of Countries from 1990 - 2017 | Kaggle | Link |
| GDP Annual Growth of Countries from 1960 - 2020 | Kaggle via WorldBank | Link |
| United Nations Population and Region Data | United Nations | Link |
| United Nations ISO-alpha3 code | United Nations | Link |
| spData for Map Geometries | spData for Mapping | Link |
The main variables in our datasets will include:
| Feature | Data Type | Unit of Measure | Notes and Assumptions |
|---|---|---|---|
| GDP (Gross Domestic Product) | Numerical, Continuous | $USD | This is our chosen proxy for measuring a country’s economic status |
| Population Size | Numerical, Continuous | thousands of people | Annual UN estimated |
| Deaths due to Air Pollution | Numerical, Continuous | deaths per million | This is our chosen proxy for measuring the negative affects of air pollution. |
| Country | Qualitative, Categorical | N/A | 231 countries |
| SDG Region | Qualitative, Categorical | N/A | UN’s Sustainable Development Goals Region Classification. |
| Sub Region | Qualitative, Categorical | N/A | UN’s Sustainable Development Goals Sub-Region Classification. |
| ISO-alpha3 Country Code | Qualitative, Categorical | N/A | Standard for identifying countries (text ID). |
| ISO-alpha2 Country Code | Qualitative, Categorical | N/A | Another standard for identifying countries (text ID). |
| M49 Country Code | Numerical, Categorical | N/A | Another standard for identifying countries (numerical ID). |
| Year | Numerical, Categorical | N/A | 1990 to 2017 |
| GDP per Capita | Numerical, Continuous | $USD per person | Normalization of GDP to compare between population sizes (calculated). |
Data Wrangling
While data from Kaggle are already in a format to be cleaned, downloaded data from United Nations required a little data wrangling. Mainly, we needed to extract just countries’ data from the Excel workbooks and into their own contained csv files. Since we only need to do this once and programming it would take significant time to choose the specific cells that we need, we opted to perform this step outside of R and in Excel. Note that if this were a part of a real production data pipeline, we would take the time to program the data extraction but would likely choose a different programming language such as Python that is a bit more robust in these types of tasks like web scraping and data transformations in Pandas.
- Figure 3: Sample screenshot of data downloaded from UN including unnecessary elements like banners and other regional data.
- Figure 4: Sample screenshot of transformed UN dataset.
3. Load, Clean, and Inspect Data
Load Data
| variable | class | first_values |
|---|---|---|
| Country.or.Area | character | Andorra, United Arab Emirates (the), Afghanistan, Antigua and Barbuda, Anguilla, Albania |
| ISO.alpha2.code | character | AD, AE, AF, AG, AI, AL |
| ISO.alpha3.code | character | AND, ARE, AFG, ATG, AIA, ALB |
| M49.code | integer | 20, 784, 4, 28, 660, 8 |
| variable | class | first_values |
|---|---|---|
| Entity | character | Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan |
| Code | character | AFG, AFG, AFG, AFG, AFG, AFG |
| Year | integer | 1990, 1991, 1992, 1993, 1994, 1995 |
| Air.pollution..total…deaths.per.100.000. | double | 299.477308883281, 291.277966734046, 278.963055615066, 278.790814746341, 287.162923177255, 288.01422374243 |
| Indoor.air.pollution..deaths.per.100.000. | double | 250.362909742375, 242.575124973334, 232.043877894811, 231.648133503794, 238.837176822107, 239.906598716878 |
| Outdoor.particulate.matter..deaths.per.100.000. | double | 46.4465894382846, 46.0338405670284, 44.2437660321924, 44.4401481443785, 45.5943284100213, 45.3671411300974 |
| Outdoor.ozone.pollution..deaths.per.100.000. | double | 5.61644203074918, 5.60396011603667, 5.61182206482564, 5.65526606275628, 5.71892222061506, 5.73917378233707 |
| variable | class | first_values |
|---|---|---|
| Country.Name | character | Aruba, Afghanistan, Angola, Albania, Andorra, Arab World |
| Country.Code | character | ABW, AFG, AGO, ALB, AND, ARB |
| Indicator.Name | character | GDP (current US\(), GDP (current US\)), GDP (current US\(), GDP (current US\)), GDP (current US\(), GDP (current US\)) |
| Indicator.Code | character | NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD |
| X1960 | double | NA, 537777811.111111, NA, NA, NA, NA |
| X1961 | double | NA, 548888895.555556, NA, NA, NA, NA |
| X1962 | double | NA, 546666677.777778, NA, NA, NA, NA |
| X1963 | double | NA, 751111191.111111, NA, NA, NA, NA |
| X1964 | double | NA, 800000044.444444, NA, NA, NA, NA |
| X1965 | double | NA, 1006666637.77778, NA, NA, NA, NA |
| variable | class | first_values |
|---|---|---|
| SDGRegion | character | SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA |
| SubRegion | character | Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa |
| Country | character | Burundi, Comoros, Djibouti, Eritrea, Ethiopia, Kenya |
| Notes | integer | NA, NA, NA, NA, NA, NA |
| Country.code | integer | 108, 174, 262, 232, 231, 404 |
| Type | character | Country/Area, Country/Area, Country/Area, Country/Area, Country/Area, Country/Area |
| Parent.code | integer | 910, 910, 910, 910, 910, 910 |
| X1950 | character | 2 309, 159, 62, 822, 18 128, 6 077 |
| X1951 | character | 2 360, 163, 63, 835, 18 467, 6 242 |
| X1952 | character | 2 406, 167, 65, 849, 18 820, 6 416 |
| variable | class | first_values |
|---|---|---|
| iso_a2 | character | FJ, TZ, EH, CA, US, KZ |
| name_long | character | Fiji, Tanzania, Western Sahara, Canada, United States, Kazakhstan |
| continent | character | Oceania, Africa, Africa, North America, North America, Asia |
| region_un | character | Oceania, Africa, Africa, Americas, Americas, Asia |
| subregion | character | Melanesia, Eastern Africa, Northern Africa, Northern America, Northern America, Central Asia |
| type | character | Sovereign country, Sovereign country, Indeterminate, Sovereign country, Country, Sovereign country |
| area_km2 | double | 19289.9707329765, 932745.792357074, 96270.6010408472, 10036042.9767873, 9510743.74482458, 2729810.51298781 |
| pop | double | 885806, 52234869, NA, 35535348, 318622525, 17288285 |
| lifeExp | double | 69.96, 64.163, NA, 81.9530487804878, 78.8414634146341, 71.62 |
| gdpPercap | double | 8222.25378436842, 2402.09940362843, NA, 43079.1425247165, 51921.9846391384, 23587.3375151466 |
Clean Data
First thing that we need to drop unnecessary columns and set datatypes (factor, num, etc.).
Clean air_pollution_df:
| variable | class | first_values |
|---|---|---|
| Country | integer | Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan |
| ISO.alpha3.code | integer | AFG, AFG, AFG, AFG, AFG, AFG |
| Year | integer | 1990, 1991, 1992, 1993, 1994, 1995 |
| Deaths.Air.Pollution.per.100k | double | 299.477308883281, 291.277966734046, 278.963055615066, 278.790814746341, 287.162923177255, 288.01422374243 |
Clean gdp_df:
| variable | class | first_values |
|---|---|---|
| Country | integer | Aruba, Aruba, Aruba, Aruba, Aruba, Aruba |
| ISO.alpha3.code | integer | ABW, ABW, ABW, ABW, ABW, ABW |
| Year | integer | 1986, 1987, 1988, 1989, 1990, 1991 |
| GDP.USD | double | 405463417.11746, 487602457.746416, 596423607.114715, 695304363.031101, 764887117.194486, 872138715.083799 |
Clean population_region_df:
| variable | class | first_values |
|---|---|---|
| SDGRegion | integer | SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA |
| SubRegion | integer | EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica |
| Country | integer | Burundi, Burundi, Burundi, Burundi, Burundi, Burundi |
| M49.code | integer | 108, 108, 108, 108, 108, 108 |
| Year | integer | 1950, 1951, 1952, 1953, 1954, 1955 |
| Population.thousands | double | 2309, 2360, 2406, 2449, 2492, 2537 |
Clean population_region_df:
| variable | class | first_values |
|---|---|---|
| Country.or.Area | integer | Andorra, United Arab Emirates (the), Afghanistan, Antigua and Barbuda, Anguilla, Albania |
| ISO.alpha2.code | integer | AD, AE, AF, AG, AI, AL |
| ISO.alpha3.code | integer | AND, ARE, AFG, ATG, AIA, ALB |
| M49.code | integer | 20, 784, 4, 28, 660, 8 |
Clean world:
| variable | class | first_values |
|---|---|---|
| iso_a2 | integer | FJ, TZ, EH, CA, US, KZ |
Note that we only have geometries for 175 countries, some will not be able to be plot on a map but that is okay.
Final DataFrame Construction
Now let’s merge our 4 datasets into one using a series of inner joins using country code and year as keys depending on the specific join. We are using inner joins because we want to drop all null values which would mean either a country does not have a country code or we have more years of data than our smallest year range (the air pollution dataset).
| variable | class | first_values |
|---|---|---|
| ISO.alpha2.code | integer | AD, AD, AD, AD, AD, AD |
| M49.code | integer | 20, 20, 20, 20, 20, 20 |
| Year | integer | 2012, 2013, 1990, 1991, 1992, 1993 |
| ISO.alpha3.code | integer | AND, AND, AND, AND, AND, AND |
| Country.x | integer | Andorra, Andorra, Andorra, Andorra, Andorra, Andorra |
| Deaths.Air.Pollution.per.100k | double | 17.6754871826169, 17.1893417774086, 29.0238806202567, 28.6956788863825, 28.4603211317312, 27.8408717612189 |
| GDP.USD | double | 3188808942.56713, 3193704343.20627, 1029048481.88051, 1106928582.86629, 1210013651.87713, 1007025755.00065 |
| SDGRegion | integer | EUROPE, EUROPE, EUROPE, EUROPE, EUROPE, EUROPE |
| SubRegion | integer | SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope |
| Population.thousands | double | 82, 81, 55, 57, 59, 61 |
| geom | list | list(), list(), list(), list(), list(), list() |
| gdp.per.capita | double | 38887.9139337455, 39428.4486815589, 18709.9723978275, 19419.7996994086, 20508.7059640192, 16508.6189344369 |
Our dataset is finally ready to be analyzed.
4. EDA Summary
All necessary EDA was performed in the Midterm Report where we looked at distributions for each key numerical variable in Air Pollution Induced Deaths per 100k, Population, GDP per Capita. We also looked into boxplots of these key numerical variables by Regions and Sub Regions. Interestingly, we observed that the same subregions that have low deaths caused by air pollution also have high GDP per capita comparatively.
Next, we checked for outliers and found that Deaths.Air.Pollution.per.100k definitely do not need to have outliers removed as it only represents 0.7% of the dataset. gdp.per.capita have higher percentage of classified outliers at 13.5%, however, we felt that keeping the extreme datapoints of this feature was important in our research and analysis to capture the true disproportionate distribution of wealth and progress across countries in our world.
Lastly, we explored a few choropleth maps to visually understand our dataset better.
We intentionally did not include any figures from EDA in this final report to preserve space. Please look at the Midterm Report for details on our entire EDA process.
5. Main Research Question
Below is a rehash of our main research question from the Midterm report and we felt that the findings from this model was important enough to warrant inclusion in this final report as well.
Do lower GDP countries have more deaths per 100k due to air pollution?
Is there a correlation between GDP per capita and deaths caused by pollution? Is it linear? How strong is the correlation?
Linear Fit
Let’s first look at the general fit on the overall data.
- Figure 15: Linear model (fit1) on overall data, deaths due to air pollution per 100k vs GDP per capita, 1990 to 2017.
From the plot, we observed that there is indeed a negative correlation between deaths due to air pollution per 100k and GDP per capita. However, the strength of that relationship is not particularly strong as the R2 is really low at 0.295. This means that only 29% of the variance experienced in deaths due to air pollution per 100k is caused by GDP per capita in a linear relationship.
Transformed Log Scale - Linear Fit
We then performed a log transformation and found that our linear regression fits much better.
fit2’s summary statistics are:
##
## Call:
## lm(formula = log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita),
## data = final_df_sf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.099 -0.235 0.000 0.206 1.431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.38778 0.02663 277 <0.0000000000000002 ***
## log(gdp.per.capita) -0.38952 0.00323 -121 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.369 on 5195 degrees of freedom
## Multiple R-squared: 0.737, Adjusted R-squared: 0.737
## F-statistic: 1.45e+04 on 1 and 5195 DF, p-value: <0.0000000000000002
We re-plotted the linear model and here were the results.
- Figure 16: Fitting to a log(y) = (m)(log(x)) + b curve yields much stronger relationship.
Across the board, the strength of our linear relationship increases dramatically when first transforming both features by the log() function first. The new R2 is now 0.737 which means around 74% of the variance in our target feature can be explained by this mathematical relationship.
We can then predict a country’s deaths caused from air pollution in a given year by using the country’s GDP per capita with the following equation:
\[ log(Deaths_{from~air~pollution|per~year|per~country} / 100,000) = 7.38778 - 0.38952 * log(GDP_{per capita}) ~~~~~~~~~~~~~~~~ eqn (1) \]
or solving for our target variable:
\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{7.38778 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]
Is there a difference in means of death caused by pollution between low, mid, and high GDP per capita?
We all know that correlation does not necessarily mean causation. Let us dig a little deeper and test if means of deaths caused by air pollution per 100k across different GDP per capita levels are equal or not.
One-Way ANOVA Test
We started off by performing a One-Way ANOVA test to determine if the means of deaths caused by air pollution per 100k across different GDP per capita levels are equal or not.
H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp
H1: At least one of \(\mu\)deaths_lowest_gdp, \(\mu\)deaths_low_gdp, \(\mu\)deaths_medium_gdp, \(\mu\)deaths_high_gdp is not equal
We will use an \(\alpha\) value of 0.05.
## $`1`
## [1] 95.2 920.3
##
## $`2`
## [1] 921 3100
##
## $`3`
## [1] 3104 11493
##
## $`4`
## [1] 11497 119106
## Df Sum Sq Mean Sq F value Pr(>F)
## qnt 3 10735714 3578571 3133 <0.0000000000000002 ***
## Residuals 5193 5932162 1142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-valuetest1 is 0e+00, which is lower than \(\alpha\)0.05. Therefore, we reject our null hypothesis that \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp. This means that there is statistically significant that at least one of the means of deaths in low, medium, and high GDP per capita are not the same.
2-Sample T-Tests
We will conduct 6 2-sample t-tests to determine if each of the groupings are different from each other:
- Lowest GDP per capita’s deaths does not equal Low GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
- Low GDP per capita’s deaths does not equal Medium GDP per capita’s deaths
- H0: \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp
- H1: \(\mu\)deaths_low_gdp != \(\mu\)deaths_medium_gdp
- Medium GDP per capita’s deaths does not equal High GDP per capita’s deaths
- H0: \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp
- H1: \(\mu\)deaths_medium_gdp != \(\mu\)deaths_high_gdp
- Lowest GDP per capita’s deaths does not equal High GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_high_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_high_gdp
- Lowest GDP per capita’s deaths does not equal Medium GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_medium_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_medium_gdp
- Low GDP per capita’s deaths does not equal Highest GDP per capita’s deaths
- H0: \(\mu\)deaths_low_gdp = \(\mu\)deaths_high_gdp
- H1: \(\mu\)deaths_low_gdp != \(\mu\)deaths_high_gdp
We used a two sample t-test for each and used an \(\alpha\) value of 0.05.
Test 1:
Conclusion of test1: p-valuetest1 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_low_gdp and accept our alternative hypothesis.
Test 2:
Conclusion of test2: p-valuetest2 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_low_gdp is equal to \(\mu\)deaths_medium_gdp and accept our alternative hypothesis.
Test 3:
Conclusion of test3: p-valuetest3 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_medium_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.
Test 4:
Conclusion of test4: p-valuetest4 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.
Test 5:
Conclusion of test5: p-valuetest5 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_medium_gdp and accept our alternative hypothesis.
Test 6:
Conclusion of test6: p-valuetest6 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_low_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.
Midterm Main Research Results
From all of our tests above, we can confirm that the means of deaths caused by air pollution are statistically significant when grouped by different levels of GDP per capita. This reinforces the idea that deaths caused by air pollution has a significant relationship with GDP per capita and the model can be quantified by Equation 2:
\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{7.38778 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]
The strength of the correlation can be quantified by our R2 value of 0.737 from Figure 16.
6. Smart Questions for Further Modeling
1. What are the impacts of population size from GDP and Deaths due to Air Pollution globally?
- Categorizing GDP into low (0), medium(1) and high(2)
SQ6 <- final_df
library(dplyr)
groupings_pop <- SQ6 %>% mutate(population_grouping = case_when(Population.thousands >= 100001 & Population.thousands <= 1000000000 ~ 2,
Population.thousands >= 50001 & Population.thousands <= 100001 ~ 1,
Population.thousands >= 0 & Population.thousands <= 50000 ~ 0)) # end function- Building a logit model GDP as predictor on Population
pop_gdpLogit <- glm(population_grouping ~ groupings_pop$gdp.per.capita, data = groupings_pop)summary(pop_gdpLogit)##
## Call:
## glm(formula = population_grouping ~ groupings_pop$gdp.per.capita,
## data = groupings_pop)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.246 -0.185 -0.181 -0.180 1.820
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.180177461 0.008492388 21.22
## groupings_pop$gdp.per.capita 0.000000553 0.000000456 1.21
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## groupings_pop$gdp.per.capita 0.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.267)
##
## Null deviance: 1389.8 on 5196 degrees of freedom
## Residual deviance: 1389.4 on 5195 degrees of freedom
## AIC: 7899
##
## Number of Fisher Scoring iterations: 2
- Bulding a logit model GDP and Deaths due to Air Population as predictor on Population
pop_gdp_deathsLogit <- glm(population_grouping ~ groupings_pop$gdp.per.capita + groupings_pop$Deaths.Air.Pollution.per.100k, data = groupings_pop)summary(pop_gdp_deathsLogit)##
## Call:
## glm(formula = population_grouping ~ groupings_pop$gdp.per.capita +
## groupings_pop$Deaths.Air.Pollution.per.100k, data = groupings_pop)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.215 -0.194 -0.186 -0.169 1.842
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.202910224 0.018188095 11.16
## groupings_pop$gdp.per.capita 0.000000136 0.000000543 0.25
## groupings_pop$Deaths.Air.Pollution.per.100k -0.000213150 0.000150811 -1.41
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## groupings_pop$gdp.per.capita 0.80
## groupings_pop$Deaths.Air.Pollution.per.100k 0.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.267)
##
## Null deviance: 1389.8 on 5196 degrees of freedom
## Residual deviance: 1388.9 on 5194 degrees of freedom
## AIC: 7899
##
## Number of Fisher Scoring iterations: 2
2. What are the effects of (low or high) GDP and population size on deaths due to air pollution in Sub-Saharan Africa?
3. Can we let the data tell us what type of groupings exist in our dataset? How consistent are they to our preconceived groupings such as region or developed vs developing countries?
We wanted to better understand what types of groupings exist within our dataset. More concretely, we wanted to put aside our preconceived presumptions about our dataset and create a K-Means clustering model and allow the the data to guide us to the possible clusters that may exist within our data.
K-Means clustering is an unsupervised machine learning algorithm that is very useful to parse the dataset and identify K number of groups where observations within each group have high similarity.
First, let’s create an index label.
| index_ | region | Deaths.Air.Pollution.per.100k | gdp.per.capita | Population.thousands | |
|---|---|---|---|---|---|
| Australia | Australia | AUSTRALIA/NEWZEALAND | 17.8 | 35412 | 20283 |
| New Zealand | New Zealand | AUSTRALIA/NEWZEALAND | 15.9 | 25279 | 4057 |
| Afghanistan | Afghanistan | CENTRALANDSOUTHERNASIA | 227.8 | 430 | 29282 |
| Bangladesh | Bangladesh | CENTRALANDSOUTHERNASIA | 144.5 | 620 | 133736 |
| Bhutan | Bhutan | CENTRALANDSOUTHERNASIA | 115.9 | 1408 | 627 |
| India | India | CENTRALANDSOUTHERNASIA | 165.9 | 838 | 1115445 |
Next, let’s select only the numerical values from our dataset.
| variable | class | first_values |
|---|---|---|
| Deaths.Air.Pollution.per.100k | double | 17.768147178761, 15.9253637864585, 227.765254524006, 144.513734731741, 115.896472640701, 165.913668178177 |
| gdp.per.capita | double | 35411.5137067032, 25278.8349742491, 430.287787933406, 619.590316576761, 1408.24273427377, 838.320549169855 |
| Population.thousands | double | 20282.8928571429, 4056.67857142857, 29282, 133735.607142857, 627, 1115445.42857143 |
Optimal K Clusters
Let’s try to find the optimal K value to achieve best clustering results. The 2 that we will try are using:
- Elbow Method with Total Within Sum of Squares
- Looking at the Gap Statistics
Elbow Method with Total Within Sum of Squares
- Figure 6.3-2: Finding the optimal number of clusters using the Within Sum of Squares Method.
From the total Within Sum of Squares plot, we find that the optimal number of K seems to be at 4 using the Elbow method which identifies the K number where the Total Sum of Squares begins to level off, or where adding additional K clusters only improves our model marginally.
Looking at the Gap Statistics
- Figure 6.3-3: Finding the optimal number of clusters using the Gap Statistic.
From the Gap Statistic plot, we find that the optimal number of K seems to be at 10 with the highest gap statistic, however, we observed using the Elbow method that the Total Sum of Squares begins to level off from 4 to 10. We will stick with using K = 4.
Perform K-Means Clustering with Optimal K
We perform a K-Means clustering analysis with a K of 4 and here are the results below.
| Cluster | Deaths.Air.Pollution.per.100k | gdp.per.capita | Population.thousands |
|---|---|---|---|
| 1 | 1.16234711051454 | -0.61738173258503 | -0.101796957671015 |
| 2 | 1.01091758337588 | -0.564333841794921 | 9.25163623841697 |
| 3 | -1.04092828247631 | 1.92017020504228 | -0.0578237235004342 |
| 4 | -0.494528110744794 | -0.253250480051675 | -0.107965219042902 |
| Cluster | Observations with Each Cluster |
|---|---|
| 1 | 67 |
| 2 | 2 |
| 3 | 34 |
| 4 | 90 |
| Cluster | Total Sum of Squares with Each Cluster |
|---|---|
| 1 | 30.88 |
| 2 | 1.52 |
| 3 | 34.88 |
| 4 | 26.96 |
Let’s take a look at clusters visually. Interestingly, our model has grouped China and India into their own separate cluster, Cluster 2. Clusters 4, 3, and 1 can be roughly characterized by high GDP per captita nations, medium GDP per capita nations, and low GDP per capita nations’.
- Figure 3.6-7: Visualing K-Means Clustering Model.
Let’s take a closer look at each cluster’s breakdown of subregions more closely to determine what type of discrepancies we find between each cluster.
- Figure 6.3-8: Frequency % of SDGRegion by Clusters.
Interestingly, we observe that Cluster 1 is disproportionately comprised of countries in Sub-Saharan Africa with large representations from Oceania, Eastern & South-Eastern Asia, and Central & Southern Asia. Cluster 2 is the one cluster comprised solely of China and India. Cluster 3 has a much larger proportion of countries from Europe, Northen America, and Northern Africa & Western Asia. Cluster 4 is similar to Cluster 3 except it does not have countries from Northern America and has more countries from Sub-Saharan Africa and Central & Southern Asia.
4. Can we make a prediction of the future GDP by considering the population size, location, and the reation of population and deaths due to air pollution?
For our major model we make a nice prediction for GDP through total deaths caused by air pollution, so as we can see that through the distribution of the data, there are strong relationships between total deaths caused by air pollution, deaths caused by indoor air pollution, and deaths caused by outdoor air pollution.
# select what we need
# total
#ggplot(total3, aes(x=Year, y=air_pollution_total)) +
# geom_bar(stat='identity', position='dodge')
# indoor
#ggplot(total3, aes(x=Year, y=air_pollution_indoor)) +
# geom_bar(stat='identity', position='dodge')
# outdoor particulate
#ggplot(total3, aes(x=Year, y=air_pollution_outdoorP)) +
# geom_bar(stat='identity', position='dodge')
# outdoor ozone
#ggplot(total3, aes(x=Year, y=air_pollution_outdoorO)) +
# geom_bar(stat='identity', position='dodge')
air_pollution_df_cleaned1 <- air_pollution_dfSo it seems like we can also get a nice GDP prediction by using deaths caused by indoor air pollution and deaths caused by outdoor air pollution as the basis of the prediction model.
# make Entity and year factor
air_pollution_df_cleaned1$Entity = factor(air_pollution_df_cleaned1$Entity)
air_pollution_df_cleaned1$Code = factor(air_pollution_df_cleaned1$Code)
air_pollution_df_cleaned1$Year = factor(air_pollution_df_cleaned1$Year)# polution
air_pollution_df_cleaned1 <- rename(air_pollution_df_cleaned1, Country = Entity, ISO.alpha3.code = Code, Deaths.Air.Pollution.per.100k = Air.pollution..total...deaths.per.100.000. , Deaths.Air.Pollution.Indoor.per.100k = Indoor.air.pollution..deaths.per.100.000., Deaths.Air.Pollution.OutdoorP.per.100k = Outdoor.particulate.matter..deaths.per.100.000., Deaths.Air.Pollution.OutdoorO.per.100k = Outdoor.ozone.pollution..deaths.per.100.000.)
# population
#population_region_df_cleaned <- subset(population_region_df, select = -c(Notes, Type, Parent.code))
#population_region_df_cleaned <- population_region_df_cleaned %>%
# pivot_longer(
# cols = starts_with("X"),
# names_to = "Year",
# values_to = "Population.thousands",
# values_drop_na = TRUE
# )
# population_region_df_cleaned$Year<-gsub("X","",as.character(population_region_df_cleaned$Year))
# population_region_df_cleaned <- as.data.frame(apply(population_region_df_cleaned, 2, function(x) gsub("\\s+", "", x)))
# population_region_df_cleaned$Country = factor(population_region_df_cleaned$Country)
# population_region_df_cleaned$SDGRegion = factor(population_region_df_cleaned$嚜燙DGRegion)
# population_region_df_cleaned$Country.code = factor(population_region_df_cleaned$Country.code)
# population_region_df_cleaned$SubRegion = factor(population_region_df_cleaned$SubRegion)
# population_region_df_cleaned$Year = factor(population_region_df_cleaned$Year)
# population_region_df_cleaned$Population.thousands = as.numeric(population_region_df_cleaned$Population.thousands)
# population_region_df_cleaned <- rename(population_region_df_cleaned, M49.code = Country.code)
# country_codes_df$M49.code = factor(country_codes_df$M49.code)
# country_codes_df$Country.or.Area = factor(country_codes_df$嚜澧ountry.or.Area)
# country_codes_df$ISO.alpha3.code = factor(country_codes_df$ISO.alpha3.code)
# country_codes_df$ISO.alpha2.code = factor(country_codes_df$ISO.alpha2.code)
# world$iso_a2 = factor(world$iso_a2)
#
# world_df_cleaned <- subset(world, select = c(iso_a2, geom))
# merge
final_dfc <- merge(x = air_pollution_df_cleaned1, y = country_codes_df, by = 'ISO.alpha3.code')
final_dfc <- merge(x = final_dfc, y = gdp_df_cleaned, by = c('ISO.alpha3.code', 'Year'))
final_dfc <- merge(x = final_dfc, y = population_region_df_cleaned, by = c('M49.code', 'Year'))
final_dfc <- merge(x = final_dfc, y = world_df_cleaned, by.x = 'ISO.alpha2.code', by.y = 'iso_a2', all.x = TRUE) #left join
# Remove columns
final_dfc <- subset(final_dfc, select = -c(Country.or.Area, Country.y, Country))
# Calculate GDP per capita
final_dfc$gdp.per.capita <- final_dfc$GDP.USD / final_dfc$Population.thousands
final_dfc$Deaths.Air.Pollution.Outdoor.Total.per.100k <- final_dfc$Deaths.Air.Pollution.OutdoorO.per.100k + final_dfc$Deaths.Air.Pollution.OutdoorP.per.100kFor the first model, we build a linear model using ‘gdp.per.capita’ and ‘Deaths.Air.Pollution.Indoor.per.100k’.
final_df_sfc = st_as_sf(final_dfc)
fit2c <- lm(gdp.per.capita ~Deaths.Air.Pollution.Indoor.per.100k, data=final_df_sfc)
summary(fit2c)##
## Call:
## lm(formula = gdp.per.capita ~ Deaths.Air.Pollution.Indoor.per.100k,
## data = final_df_sfc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15508268 -8789754 -3652222 3500260 102785447
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 16339920 254681 64.2
## Deaths.Air.Pollution.Indoor.per.100k -126647 3308 -38.3
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## Deaths.Air.Pollution.Indoor.per.100k <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13900000 on 5195 degrees of freedom
## Multiple R-squared: 0.22, Adjusted R-squared: 0.22
## F-statistic: 1.47e+03 on 1 and 5195 DF, p-value: <0.0000000000000002
ggplot(final_df_sfc, aes(gdp.per.capita, Deaths.Air.Pollution.Indoor.per.100k)) +
geom_point() +
geom_smooth(method='lm', se=FALSE) +
stat_regline_equation(label.y = 350, aes(label = ..eq.label..)) +
stat_regline_equation(label.y = 400, aes(label = ..rr.label..))- Figure 6.4-1: Deaths due to Indoor Air Pollution vs GDP per capita.
The R-squared value of this model is 0.22, which is a really bad result. So follow the steps in the major model building, now we put these two features into a log() and it should performed better.
##
## Call:
## lm(formula = log(gdp.per.capita) ~ log(Deaths.Air.Pollution.Indoor.per.100k),
## data = final_df_sfc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.354 -0.599 0.098 0.647 2.892
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 16.33257 0.01763 927
## log(Deaths.Air.Pollution.Indoor.per.100k) -0.54761 0.00517 -106
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## log(Deaths.Air.Pollution.Indoor.per.100k) <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.891 on 5195 degrees of freedom
## Multiple R-squared: 0.684, Adjusted R-squared: 0.684
## F-statistic: 1.12e+04 on 1 and 5195 DF, p-value: <0.0000000000000002
- Figure 6.4-2: Log Deaths due to Indoor Air Pollution vs Log GDP per capita.
As we can see, although it performed much better with a R-squared value 0.68 but this is still not good as the major model.
Next we try the deaths caused by outdoor air pollution ‘Deaths.Air.Pollution.Outdoor.Total.per.100k’.
##
## Call:
## lm(formula = gdp.per.capita ~ Deaths.Air.Pollution.Outdoor.Total.per.100k,
## data = final_df_sfc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14051236 -9537840 -5335898 2518144 106063637
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 15643215 461880 33.9
## Deaths.Air.Pollution.Outdoor.Total.per.100k -150382 10830 -13.9
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## Deaths.Air.Pollution.Outdoor.Total.per.100k <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15400000 on 5195 degrees of freedom
## Multiple R-squared: 0.0358, Adjusted R-squared: 0.0356
## F-statistic: 193 on 1 and 5195 DF, p-value: <0.0000000000000002
- Figure 6.4-3: Deaths due to Outdoor Air Pollution vs GDP per capita.
The performence of this model is also bad, with a R-squared value 0.036, so we put the features in the log() too.
##
## Call:
## lm(formula = log(gdp.per.capita) ~ log(Deaths.Air.Pollution.Outdoor.Total.per.100k),
## data = final_df_sfc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.591 -1.272 -0.027 1.297 3.471
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 15.6988 0.1565 100.28
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k) -0.1991 0.0442 -4.51
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k) 0.0000068 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.58 on 5195 degrees of freedom
## Multiple R-squared: 0.00389, Adjusted R-squared: 0.0037
## F-statistic: 20.3 on 1 and 5195 DF, p-value: 0.00000675
- Figure 6.4-4: Log Deaths due to Outdoor Air Pollution vs Log GDP per capita.
A surprised point is that although we put the variables into log(), the performance of model is still bad and even worse. The R-squared value of this model is 0.0039.
Finally, we are going to test build a model base on both deaths caused by outdoor and indoor air pollution. To check if this model also works bad.
##
## Call:
## lm(formula = gdp.per.capita ~ Deaths.Air.Pollution.Indoor.per.100k +
## Deaths.Air.Pollution.Outdoor.Total.per.100k, data = final_df_sfc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18198746 -7911990 -2886359 3411956 96919910
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 26403084 457576 57.7
## Deaths.Air.Pollution.Indoor.per.100k -144489 3190 -45.3
## Deaths.Air.Pollution.Outdoor.Total.per.100k -242555 9394 -25.8
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## Deaths.Air.Pollution.Indoor.per.100k <0.0000000000000002 ***
## Deaths.Air.Pollution.Outdoor.Total.per.100k <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13100000 on 5194 degrees of freedom
## Multiple R-squared: 0.309, Adjusted R-squared: 0.309
## F-statistic: 1.16e+03 on 2 and 5194 DF, p-value: <0.0000000000000002
- Figure 6.4-5: Deaths due to Indoor + Outdoor Air Pollution vs GDP per capita.
The R-squared value of this model is 0.29. Next we put these variables into log().
##
## Call:
## lm(formula = log(gdp.per.capita) ~ log(Deaths.Air.Pollution.Outdoor.Total.per.100k) +
## log(Deaths.Air.Pollution.Indoor.per.100k), data = final_df_sfc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.200 -0.594 0.055 0.621 2.901
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 17.60605 0.08820 199.6
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k) -0.35985 0.02444 -14.7
## log(Deaths.Air.Pollution.Indoor.per.100k) -0.55212 0.00507 -108.8
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k) <0.0000000000000002 ***
## log(Deaths.Air.Pollution.Indoor.per.100k) <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.873 on 5194 degrees of freedom
## Multiple R-squared: 0.696, Adjusted R-squared: 0.696
## F-statistic: 5.96e+03 on 2 and 5194 DF, p-value: <0.0000000000000002
- Figure 6.4-6: Log Deaths due to Indoor + Outdoor Air Pollution vs Log GDP per capita.
For this model the R-squared value is 0.69. This is actually not a bad result but when comparing to the major model using total deaths caused by air pollution, 0.69 is not good enough for predicting GDP.
We can get the conclusion that we can not get a nice GDP prediction by using deaths caused by indoor air pollution and deaths caused by outdoor air pollution as the basis of the prediction model.
7. Conclusion SMART Modeling
For the Logit Regression, we observed small p-values indicating that all the coefficients are found to be significant. GDP has a positive effect on population, but deaths due to air pollution have a negative effect on population.
As shown above, post-pruning, our accuracy is still 74%. So, pruning had little effect although the tree has become simpler. Therefore, we have a model that can predict high deaths due to air pollution with reasonable (74%) accuracy.
For the clustering, it is really interesting to observe that the main clusters 1, 3, and 4 have large discrepancies in regional representation while the data we used for clustering did not have any explicit geospatial components. This can mean that regionality, although may not be the cause, does have some correlation in determining the groups of factors in GDP per Capita, Population, and Deaths due to Air Pollution per 100k.
Although death caused by indoor air pollution and outdoor air pollution seems to have a strong relationship with total death caused by air pollution, but we can’t accurately predict GDP through death caused by indoor and outdoor air pollution.
8. Bibliography
| Number | APA Citation |
|---|---|
| 1 | Robin Lovelace, J. N. (n.d.). Chapter 8 Making maps with R: Geocomputation with R. Retrieved October 28, 2021, from https://geocompr.robinlovelace.net/adv-map.html |
| 2 | Robin Lovelace, J. N. (2021, October 28). Chapter 2 Geographic data in R: Geocomputation with R. Retrieved from https://geocompr.robinlovelace.net/spatial-class.html#intro-sf |
| 3 | Hadley Wickham, D. N. (2021, October 28). 6 Maps. Retrieved from https://ggplot2-book.org/maps.html |
| 4 | Customizing ggplot2 color and fill scales. (2021, October 28). Retrieved from https://spielmanlab.github.io/introverse/articles/color_fill_scales.html |
| 5 | Logarithmic Functions. (2021, October 28). Retrieved from https://saylordotorg.github.io/text_intermediate-algebra/s10-03-logarithmic-functions-and-thei.html |
| 6 | K-Means Clustering in R. (2021, December 3). Retrieved from https://www.statology.org/k-means-clustering-in-r/ |